Identification of hub genes based on integrated analysis of single-cell and microarray transcriptome in patients with pulmonary arterial hypertension

Background Pulmonary arterial hypertension (PAH) is a devastating chronic cardiopulmonary disease without an effective therapeutic approach. The underlying molecular mechanism of PAH remains largely unexplored at single-cell resolution. Methods Single-cell RNA sequencing (scRNA-seq) data from the Gene Expression Omnibus (GEO) database (GSE210248) was included and analyzed comprehensively. Additionally, microarray transcriptome data including 15 lung tissue from PAH patients and 11 normal samples (GSE113439) was also obtained. Seurat R package was applied to process scRNA-seq data. Uniform manifold approximation and projection (UMAP) was utilized for dimensionality reduction and cluster identification, and the SingleR package was performed for cell annotation. FindAllMarkers analysis and ClusterProfiler package were applied to identify differentially expressed genes (DEGs) for each cluster in GSE210248 and GSE113439, respectively. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genome (KEGG) were used for functional enrichment analysis of DEGs. Microenvironment Cell Populations counter (MCP counter) was applied to evaluate the immune cell infiltration. STRING was used to construct a protein-protein interaction (PPI) network of DEGs, followed by hub genes selection through Cytoscape software and Veen Diagram. Results Nineteen thousand five hundred seventy-six cells from 3 donors and 21,896 cells from 3 PAH patients remained for subsequent analysis after filtration. A total of 42 cell clusters were identified through UMAP and annotated by the SingleR package. 10 cell clusters with the top 10 cell amounts were selected for consequent analysis. Compared with the control group, the proportion of adipocytes and fibroblasts was significantly reduced, while CD8+ T cells and macrophages were notably increased in the PAH group. MCP counter revealed decreased distribution of CD8+ T cells, cytotoxic lymphocytes, and NK cells, as well as increased infiltration of monocytic lineage in PAH lung samples. Among 997 DEGs in GSE113439, module 1 with 68 critical genes was screened out through the MCODE plug-in in Cytoscape software. The top 20 DEGs in each cluster of GSE210248 were filtered out by the Cytohubba plug-in using the MCC method. Eventually, WDR43 and GNL2 were found significantly increased in PAH and identified as the hub genes after overlapping these DEGs from GSE210248 and GSE113439. Conclusion WDR43 and GNL2 might provide novel insight into revealing the new molecular mechanisms and potential therapeutic targets for PAH.


Introduction
Pulmonary artery hypertension (PAH) is a chronic severe progressive cardiopulmonary disease characterized by pulmonary arterial pressure elevation and right ventricular hypertrophy [1].The prevalence of PAH is 10.6 per million adults in America nowadays [2].Despite the benefits of treatments targeting nitric oxide, prostacyclin, and endothelin pathways to delayPAH progression and improve survival, only lung transplantation is considered a curative approach [3].PAH remains an incurable chronic disease with a poor prognosis [4].Vasoconstriction, obstructive pulmonary vasculopathy characterized by hyperproliferation and anti-apoptosis phenotype of PASMCs, excessive fibrosis, inflammation, thrombosis, and altered mitochondrial metabolic all participated in the mechanisms implicated in PAH [5].However, there remains largely unexplored on the pathogenesis of PAH.Therefore, systematic analysis of the function of different cell types in the pulmonary tissue of PAH patients might help deepen understanding of the pathological mechanism of PAH.
Microarray transcriptome has been increasingly and widely used to examine gene expression in PAH [6,7].However, data of microarray transcriptome represents the average gene expression amounts of various cells at the whole level of tissue [8].Lung tissues contain various cell types, including smooth muscle cells, endothelial cells, fibroblasts, immune cells, inflammatory cells, etc.They play different roles throughout the development of PAH.Currently, a novel single-cell RNA sequencing (scRNA-seq) technology is emerging to investigate cell heterogeneity, characterize each cell subpopulation, and putative intracellular communication [9,10].This innovative technology has advanced our understanding of PAH at the cell subpopulation level.scRNA-seq has been carried out in lung samples of both PAH rodent models and PAH patients.Previous research reported NF-κB signaling activation in immune cells of monocrotaline and hypoxia-induced PH rat model [11].Based on the scRNA-seq data of lung ECs from hypoxic pulmonary hypertension mice, Julie and his colleagues indicated CD74 was involved in the regulation of endothelial cell proliferation and barrier integrity [12].However, scRNAseq data on PAH is relatively small and still in its infancy currently.
In the present study, integrated bioinformatics analysis of scRNA-seq and microarray transcriptome data from the GEO dataset was analyzed to identify the hub genes in PAH.Differentially expressed genes (DEGs) from GSE210248 and GSE113439 were identified and common DEGs were selected.Protein-protein interaction network (PPI) network was constructed using the aforementioned DEGs, followed by hub gene selection through Cytoscape software.Finally, GNL2 and WDR43 were identified as hub genes, which might provide new insight into the pathogenesis of PAH and act as novel candidates and therapeutic targets for PAH.

Data acquisition
Data were all processed and analyzed by R software (Version 4.3.0).Both scRNA-seq (GSE210248) and microarray transcriptome (GSE113439) data were obtained from the Gene Expression Omnibus (GEO, http:// www.ncbi.nlm.nih.gov/ geo/) database [13] and downloaded through the GEO query package (Version 2.68.0).GSE210248 and GSE113439 were selected in the current research because the samples in the two datasets were obtained from the lung/pulmonary arteries of participants with pulmonary hypertension, rather than the PAH rodent model.Additionally, GSE11339 has a relatively large The details of the two datasets enrolled in this study were listed in Table 1.The GSE210248 scRNA-seq data and GSE113439 array data were generated on GPL20301 Illumina Hiseq 4000 and GPL6244 Affymetrix Human Gene 1.0 ST Array platform, respectively.GSE210248 data included pulmonary arteries from 3 PAH patients and 3 healthy donor control.The dataset contains 19,576 cells from the control group and 22,704

Processing of scRNA-seq data
Seurat package (Version 4.3.0)was used for quality control.Cells with 200-2500 genes and < 5% mitochondrial genes were selected for consequent analysis.A total of the 19,576 cells in control group and 21,896 cells in the PAH group were screened out for analysis.Data of genes was further normalized using the "LogNormalize" method and further scaled.Then, the top 2000 highly variable genes (HVGs) were identified by the FindVariableFetures function with the "vst" method.Subsequently, principal component analysis (PCA) was applied to identify significant principal components (PCs), and the p-value was visualized using the JackStraw and ScoreJackStraw functions.Uniform manifold approximation and projection (UMAP) was utilized for dimensionality reduction with 20 PCs and cluster identification across these cells."Harmony" R package was used for batch correction to avoid the batch effect of sample identity which might disrupt the downstream analysis [14].SingleR package (Version 2.2.0) [15] was utilized for cell annotation according to the reference datasets HumanPri-maryCellAtlasData [16] and BlueprintEncodeData [17].FindAllMarkers analysis with |log 2 fold change (FC)|> 1 and adjusted p value < 0.05 were performed to screen out the differentially expressed genes (DEGs) for each cell cluster.scRNAtoolVis package (Version 0.0.5) was performed to display the top DEGs and visualized by jjvolcano.

Processing of microarray transcriptome data
DEGs between the control and PAH groups with an adjusted p value < 0.05 were screened out using the limma package (Version 3.56.2) [18].All DEGs were visualized using the volcano plot and the top 50 DEGs were visualized through the heatmap plot in the "ggplot2" package.

Microenvironment Cell Populations counter (MCP counter)
The infiltration of microenvironment immune cells including B lineage, CD8 T cells, cytotoxic lymphocytes, endothelial cells, monocytic lineage, myeloid dendritic cells, neutrophils, NK cells, and T cells was quantified through the MCP counter (Version 1.2.0) based on scRNA-seq data [22].

PPI network construction and identification of hub genes
DEGs in scRNA-seq and microarray transcriptome data were screened by FindAllMarkers analysis in the Seurat package and the limma package.Subsequently, proteinprotein interaction (PPI) networks were constructed for the prediction of internal connection among the picked DEGs using the STRING database (Version 11.5, https:// string-db.org/) with an interaction conference score set to 0.4 [23].Then, hub genes were screened out and network visualization was performed using Cytoscape software (Version 3.10.0)[24].

ScRNA profiling in PAH
The scRNA-seq data of GSE210248 was downloaded from the GEO database and analyzed through R software.In general, 42,280 cells comprising 19,576 cells from donors (control) and 22,704 cells from PAH patients were included.After filtrating improper gene amounts or mitochondrial genes ≥ 5%, 19,576 cells from donors and 21,896 cells from PAH patients remained.Figure 1A presented the expression characteristics of each sample.As shown in Fig. 1B, nCount_RNA (the number of unique molecular identifiers) was positively correlated with nFeature_RNA (the number of genes) with a correlation coefficient of 0.93. Figure 1C displayed and labeled the top 10 HVGs: SFTPC, CCL21, SFTPA1, IGKC, STFPA2, STFPB, PGC, TPSB2, TPSAB1, S100A12.The top 20 PCs identified by PCA were visualized by JackStrawPlot (Fig. 1D).In addition, the top 10 DEGs in each cluster were presented by heatmap and labeled in yellow (Fig. 1E).

Cell clusters identification in scRNA-seq
Forty-one thousand four hundred seventy-two cells were divided into 42 cell clusters and visualized through UMAP, and cell annotation was performed by SingleR package (Fig. 2A).The number of cells in some clusters was too small, therefore, 10 clusters (Adipocytes, CD4+ T cells, CD8+ T cells, chondrocytes, endothelial cells, epithelial cells, fibroblasts, macrophages, monocytes, and NK cells) with the top 10 cell amounts were selected for subsequent analysis.Cell numbers of each cluster were shown in Table 2.The distribution of each cluster in the selected 10 clusters was presented in Fig. 2B, and the results of cell cluster distribution grouped by control and PAH were displayed in Fig. 2C.Additionally, the number and proportion of cells in each sample were exhibited in Fig. 2D.In comparison with the control group, the proportion of adipocytes (39.4% vs. 5.6%) and fibroblasts (18.1% vs. 2.5%) was significantly reduced in the PAH group, while CD8+ T cell (3.3% vs. 50.0%)and macrophages (4.3% vs. 14.8%) were notably increased in PAH lung tissues compared with donors.

DEGs of each cluster in GSE210248
The DEGs of each cluster between the control and PAH groups were identified using the FindMarkers function.
The top 10 DEGs in each cluster were listed in Table 3.For instance, GZMB, ATP5E, GNB2L17, GKN2, ATP5E6, SLPI5, SELM2, SFTPC3, SFTPC2, and GNB2L11 were the most significant DEG in NK cells, macrophages, fibroblasts, monocytes, epithelial_cells, endothelial_cells, chondrocytes, CD8+ T-cells, CD4+ T-cells, and adipocytes based on adjusted p-value.scRNAtoolVis package was further performed to intuitively illustrate the top 5 upregulated and the top 5 downregulated genes in the PAH group compared with the control group and visualized by jjvolcano (Fig. 3).

DEGs of pulmonary tissue in GSE113439
The

Different immune cell infiltration of pulmonary tissue in GSE113439
Using microarray transcriptome data from GSE113439, the MCP counter was utilized to evaluate the immune cell infiltration in control and PAH lung samples.As shown in Fig. 5A, statistically decreased distribution of CD8+ T cells, cytotoxic lymphocytes, and NK cells were found in lung tissues of PAH patients compared

Protein-protein interaction network (PPI) network and common DEGs identification in GSE210248 and GSE113439
The PPI network of DEGs from GSE113439 was generated by STRING (Fig. 6).The PPI network consisted of 945 nodes and 7266 edges in 997 DEGs.Then, the PPI network of the DEGs in each cluster of GSE210248  monocytes with the adjusted p-value < 0.05.A Venn diagram was drawn to screen out the common hub genes from GSE210248 and GSE113439.As shown in Fig. 8, a series of genes were identified through overlapping DEGs from each cluster in GSE210248 and DEGs from GSE113439.The details of overlapped genes were listed in Table 4.

Hub genes identification in PAH
Considering the amounts of selected common DEGs were relatively large.MCODE was utilized for the selection of candidate hub genes from the PPI network of 997 DEGs in GSE113439.Module 1 with the highest score (68 nodes and 1142 edges) was screened out (Fig. 9A).The centralities of the candidate genes in module 1 were evaluated by the CentiScaPe plug-in and the details were shown in Table 5.Additionally, the CytoHubba plugin was used for ranking nodes in module 1 using MCC methods.The MCODE score of each gene is also summarized in Table 5. Cytohubba plug-in was performed to further simplify these hub genes and pick out the most critical genes using the MCC method.No overlapped genes were found between DEGs from adipocytes, CD4+ T cells, and DEGs from GSE113409.Therefore, the top 20 genes were filtered out in the remaining 8 clusters of GSE210248 and presented in Fig. 9B-I.Less than 20 DEGs in chondrocytes and epithelial cells were presented because there are only 15 DEGs in chondrocytes and 14 DEGs in epithelial cells were included in the network of Cytoscape software.We further screened out the common hub genes using data from module 1 and the top 20 genes in these clusters.As shown in Fig. 10, WDR43 in chondrocytes and GNL2 in CD8+ T cells were finally identified as the most significant genes in PAH.Furthermore, we detected the expression of WDR43 and GNL2 in the lung samples of 15 PAH patients and 11 control subjects in GSE113439 and found significantly increased WDR43 and GNL2 expression (Fig. 11).

Discussion
The present study for the first time indicated WDR43 and GNL2 might act as key genes involved in the pathogenesis of PAH, providing a novel potential underlying mechanism of PAH.In the current study, common DEGs were screened out using integrated analysis of scRNAseq and microarray transcriptome through the limma package and Seurat package in R software.Subsequently, the PPI network of DEGs was constructed using the STRING website.Then, Cytoscape software was utilized to screen out the hub genes in the cluster of GSE210248 and GSE113439.Ultimately, we identified two hub genes (WDR43 and GNL2) in PAH through a series of bioinformatics analyses.MCP counter illustrated dysregulated landscape of immune cells in lung tissues of PAH patients, which is consistent with previous reports.Marlene reviewed immune dysregulation in PAH and how immunemediated vascular injury promoted PAH development [25].For instance, circulating autoantibodies against endothelial cells might enhance the apoptosis of endothelial cells in PAH [26].T cells and NK cells were considered as beneficial factors during the pathogenesis of PAH [27,28].Additionally, the role of perivascular macrophages has received extensive attention from  researchers.Widespread Cd68+ macrophages were detected in occlusive plexiform lesions in clinical and experimental PAH models [29].Inactivation or deletion of macrophages could prevent the development of PAH [30].More researches need to be carried out to further explore the role of various immune cells in PAH and the underlying mechanisms.The WD40 repeat (WDR) domain is the most abundant protein interaction domain in the human proteome.The WDR43 gene is located on chromosome 2 and encodes the WDR43 protein containing 677 amino [31].Of note, WDR43 is an essential subunit of multiprotein complexes and is involved in a series of signaling pathways including ubiquitin-proteasome pathway, epigenetic regulation, DNA damage repair, and immune-related pathways [32].For instance, the NOL11-WDR43-Cirhin protein complex is necessary for mitotic chromosome segregation [33].Intriguingly, several bioinformatics analysis identified WDR43 as a crucial oncogene contributing to the development of colorectal/lung cancer via promoting the migration and proliferation of cancer cells through GEO and The Cancer Genome Atlas (TCGA) database.Mechanistically, c-MYC/WDR43/MDM2 mediated p53 degradation, and cyclin-dependent kinase 2 were involved in the underlyng mechanism [34][35][36].However, the role of WDR43 in PAH remains uninvestigated.. Similarly, the imbalance of proliferation and apoptosis in pulmonary artery smooth muscle cells (PASMCs) was also the key   characteristic in pulmonary hypertension [37].Therefore, we speculate WDR43 might contribute to PASMCs proliferation and migration, then leading to the pulmonary artery remodeling.GNL2, the G protein nucleolar 2, was found essential for cell growth and development through participating in the cell-cycle regulation pathway [38].GNL2 acts as a checkpoint for ribosome export, and it plays a vital role in facilitating ribosomal biogenesis and protein synthesis [39].GNL2 was found to play a critical role in the RNA metabolic network and was associated with proliferation [40].Increased expression of GNL2 was correlated with poor prognosis in ovarian cancer patients with 1p34.3 amplifications [41].Results from another scRNA-seq data of periodontitis revealed GNL2 was upregulated in T cells [42].While the role of GNL2 in PAH and its potential underlying mechanisms needs further exploration.In combination with the KEGG analysis in the current study, GNL2 might participate in the underlying mechanism of PH through the influence on the biosome biogenesis and cell cycle.
Nowadays, high-throughput RNA sequencing has been widely used to explore novel mechanisms of PAH [6,43].Especially, with the rapid development of singlecell sequencing, integrated bioinformatics analysis of microarray transcriptome and scRNA-seq, a newly-rising research method, has attracted researchers' attention lately [44].A recent study indicated hpgd was a key gene in pulmonary artery endothelial cells (PAECs) using scRNA-seq data from PAECs of control and PAH rodents [45].There still remains largely unknown on the mechanism of PAH through integrated bioinformatics analysis.The current research might provide a novel insight into the pathogenesis of PAH.

Conclusion
In summary, we performed an integrated bioinformatics analysis of single-cell sequencing andmicroarray transcriptome.Multi-step analysis suggested that WDR43 and GNL2 were increased in PAH lung tissues and they were identified as hub genes in the pathogenesis of PAH.Our results highlight WDR43 and GNL2 as potential biomarkers and pharmacological therapeutic targets for PAH.

Fig. 1 Fig. 2
Fig. 1 Single-cell RNA sequencing analysis of GSE210148 in PAH.A The features, counts, and mitochondrial gene percentage of each sample.B Correlation between genes and counts in each sample.C HVGs were colored in red, and the top 10 HVGs were labeled.D PCs selection using JackStraw function.E Heatmap of top 10 DEGs in each cluster.The top 10 DEGs were labeled in yellow color

Fig. 3
Fig.3The top 5 upregulated and top 5 downregulated DEGs in the PAH group compared with the control group using the jjVolcano map

Fig. 4
Fig. 4 DEGs of lung tissue from GSE113439 dataset.A Volcano plot of DEGs with |log 2 FC|> 0.856 and adjusted p value < 0.05.Upregulated and downregulated genes were colored by red and blue, respectively.B Heatmap displaying the top 50 DEGs of GSE113439.C KEGG of DEGs in GSE113439.Dot blot of the top 10 CC (D), BP (E), and MF (F) pathways of GO in GSE113439.The size and color of dots represent the count of genes and adjusted p value in the selected pathway

Fig. 5
Fig. 5 Dysregulated immune cells infiltration in PAH lungs.A The box plot of immune cells abundance in control and PAH group.B Heatmap displaying the abundance of immune cells in each sample of lung tissues in control and PAH patients

Fig. 9
Fig.9 Hub genes of GSE113439 and hub genes of each cluster from GSE210248.A Selected hub genes in module 1 of GSE113439 using MCODE pulg-in.The Top 20 hub genes in CD8+ T cells (B), chondrocytes (C), endothelial cells (D), epithelial cells (E), fibroblasts (F), macrophages (G), monocytes (H), and NK cells (I) identified by cytohubba plug-in according to nodes' score by MCC method from GSE210248

Fig. 10
Fig. 10 Veen's diagram showing the common hub genes from GSE210248 and GSE113439.A Veen's diagram showing the common hub genes from GSE113439, CD8+ T cells, chondrocytes, and endothelial cells from GSE210248.B Veen's diagram showing the common hub genes from GSE113439, epithelial cells, fibroblasts, and macrophages from GSE210248.C Veen's diagram showing the common hub genes from GSE113439, monocytes and NK cells from GSE210248

Fig. 11
Fig. 11 The expression of selected hub genes.The expression of WDR43 (A) and GNL2 (B) in lung samples of 11 healthy control and 15 PAH patients in GSE113439.**p < 0.01

Table 1
Overview of the enrolled datasets in the current study cells from the PAH group.The data of GSE113439 included fresh frozen lung samples from the recipients' organs of 15 PAH patients and 11 normal lung samples obtained from tissue flanking lung cancer resections.

Table 2
Cell numbers in each cluster

Table 3
The top 10 DEGs in each cell cluster between control and PAH group

Table 4
Common DEGs from each cluster of GSE210248 and GSE113439

Table 5
The centralities and MCODE score of candidate genes evaluated by CentiScape and CytoHubba plug-in